A Rule Induction Approach to Modeling Regional Pronunciation Variation
نویسندگان
چکیده
This 1)~q)er descril)es the use of rule induetion techniques fi)r the mli;omatic exl;ra(:l;ion of l)honemic knowledge mM rules fl'om pairs of l:,romm(:intion lexi(:a. This (:xtra(:ted knowledge allows the ndat)tntion of sl)ee(:h pro(:essing systelns tO regional vm'iants of a language. As a case sl;u(ty, we apply the approach to Northern Dutch and Flemish (the wtriant of Dutch spoken in Flan(lers, a t)art; of Belgium), based Oll C(?lex and l'bnilex, prommclarion lexi(:a tbr Norttmrn l)utch mM Fhm,ish, r(~sl)e(:tively. In our study, we (:omt)ar(~ l;wo rule ilMu(:tion techniques, f ranslbrmationB;tsed Error-l)riven Learning ('I'I/E])I,) (Brill, 1995) mM C5.0 (Quinl~m, 1993), and (,valuate the extr~tct(xl knowh;dge quanl:it~l;ively (a(:(:ura.cy) mM qualitatively (linguistic r(;levanc:e of the rules). We (:onchMe that. whereas classificntion-1)ased rule. induct;ion with C5.0 is 11101.'0 a(;(:(lr&l;e~ th(? |;rallSt~)rnl;~l;ion l"ules le;~rne(t with TBE1)I, can 1)e more easily ini;ert)reted. 1. I n t r o d u c t i o n A central (:onq)onenl; of speech l)ro(;essing systems is a t)rommciation lexicon detining the relntionshi t) between the sl)elling mM t)rommcin|;ioi1 of words. Regionnl wMants of ~ langut~ge may differ considerably in their l)ronunci:ttion. Once a spe~ker from a particular region is detected, speech inlmt and output systems should be al)lc to ~Mat)l; their t)rommei;Ltion lexi(:on l;o this regionM vm'bml;. Regional l)rommciation (litiin'ences are mostly systeln~ti(: mM can t)e modeled using rules designed by experts. However, in this 1)at)er, we investigate the :mtoma* This resear(:h was l)artially funded 1)y the. F\V() 1)reject Linguaduct and the i\VT project CGN (Cortms Gesprokcn Nedcrhmds). tion of this process by using data-driven ted> niques, more. specitically, rule induction techniques. l)ata-(lriven reel;hods have proven their effi(',;tcy in severM language engineering tasks: such as gr~l)hemc-to-tfl~oncmc conversion, tmrt;of-sl)eech tagging, el;(:. Extraction of linguistic knowledge, fl'(nn a snmple corlms instead of numuM encoding of linguistic intbrmation proved to be ml extremely powcrflfl method tbr overcoming the, linguistic knowledge acquisition bottlene(:k. ])itt'erent at)preaches are awfilM)le, such as decision-tree le~rrning (l)ietterich, 1997), lleural lml;work or (:onne(:tionist al)proaches (Sejnowski ~tnd l/.os(ml)erg, 1987), lnemory-base(1 lena'ning (Daelemans mM van den Bosch, 1996) el;(:, l)at~-driv(m al)i)roaehcs (:~m yield (:Oral);> ral)le (;111(t often eVell better) results ttum the rule-lmsed at)t)ro;mh, as described in the work of l)aelemans nnd wm den ]~os(:h (199(i) in which a (:omt)~rison is mnde 1)ctwe(m Morpa-cmnMorphon (Heemskerk and wm He, uv(m, 1993), an ex:mlt)le of n linguistic knowledge 1)a.sed at)1)roacll |;o gr~t)heme-to-1)honem(~ (:OllVersion and [G-'.lh'ee, an examph; of n m(mloryd)ased at)1)roach (Daelen~ms et M., 1996). Ill this study, we will look tbr the patterns and generalizations in the i)honemic ditrer(m(:es 1)et;ween Dutch and Fhmfish 1)y using two (tat;ndriven t(~chniques. It; is our aim to extract the regularities that are implicitly contained in the data. Two corpora were used tbr this study, r(~l)resenting the Norl;hern Dul, eh and Sout;hern Dutch w~rbmts. D)r Northenl Dut(:h Celex (releas(; 2) was used and for Flemish Fonilex (versioll 1.01)). The Celex datM)ase contains fiequen(:y infi)rlnation (based on the INL corl)uS of the hlsl;itute fi)r 1)ul;(:h Lexieology), and i)honologi(:al~ morl)hologicM , and synt;a(:tic lexicM intbrmation tbr more l;tmn 384.000 word forms,
منابع مشابه
Statistical Modeling of Pronunciation Variation by Hierarchical Grouping Rule Inference
In this paper, a data-driven approach to statistical modeling pronunciation variation is proposed. It consists of learning stochastic pronunciation rules. The proposed method jointly models different rules that define the same transformation. Hierarchic Grouping Rule Inference (HIEGRI) algorithm is proposed to generate this model based on graphs. HIEGRI algorithm detects the common patterns of ...
متن کاملModeling Pronunciation Variation for Asr: Comparing Criteria for Rule Selection
In this paper we use a data-driven (DD) rule-based method for modeling pronunciation variation. Error analysis is performed in order to gain insight into the effect of pronunciation variation modeling. This analysis shows that although modeling pronunciation variation brings about improvements, deteriorations are also introduced. A strong correlation is found between the number of improvements ...
متن کاملImproving the Performance of a Dutch Csr by Modeling Pronunciation Variation
This paper describes how the performance of a continuous speech recognizer for Dutch has been improved by modeling pronunciation variation. We used three methods in order to model pronunciation variation. First, withinword variation was dealt with. Phonological rules were applied to the words in the lexicon, thus automatically generating pronunciation variants. Secondly, cross-word pronunciatio...
متن کاملA data-driven method for modeling pronunciation variation
This paper describes a rule-based data-driven (DD) method to model pronunciation variation in automatic speech recognition (ASR). The DD method consists of the following steps. First, the possible pronunciation variants are generated by making each phone in the canonical transcription of the word optional. Next, forced recognition is performed in order to determine which variant best matches th...
متن کاملModeling Cross-morpheme Pro for Korean Large Vocabulary Cont
In this paper, we describe a cross-morpheme pronunciation variation model which is especially useful for constructing morpheme-based pronunciation lexicon for Korean LVCSR. There are a lot of pronunciation variations occurring at morpheme boundaries in continuous speech. Since phonemic context together with morphological category and morpheme boundary information affect Korean pronunciation var...
متن کاملDeveloping consistent pronunciation models for phonemic variants
Pronunciation lexicons often contain pronunciation variants. This can create two problems: It can be difficult to define these variants in an internally consistent way and it can also be difficult to extract generalised grapheme-to-phoneme rule sets from a lexicon containing variants. In this paper we address both these issues by creating ‘pseudo-phonemes’ associated with sets of ‘generation re...
متن کامل